Objective:¶
- Perform Exploratory Data Analysis (EDA) to uncover patterns, correlations, and insights from the dataset.
- Prepare the data for machine learning by handling missing values, scaling, and splitting into training and test sets.
- Build and evaluate a classification model to predict the diabetes outcome based on health-related features.
- Use metrics like accuracy, recall, and F1-score to evaluate the model’s performance.
- Generate visualizations to help better understand feature distributions and model predictions.
Dataset Overview:¶
This dataset consists of 9 columns that provide health and demographic information about patients. Here's a breakdown of each column:¶
Pregnancies:- Represents the number of times the patient has been pregnant.
- Type: Integer
Glucose:- Plasma glucose concentration (measured after 2 hours in an oral glucose tolerance test).
- Type: Float
- Higher glucose levels may indicate poor insulin control.
BloodPressure:- Diastolic blood pressure (mm Hg).
- Type: Float
- Tracks heart health and blood circulation.
SkinThickness:- Triceps skinfold thickness (mm).
- Type: Float
- Acts as an indirect measure of body fat.
Insulin:- 2-hour serum insulin (mu U/ml).
- Type: Float
- Measures insulin function and glucose metabolism.
BMI:- Body mass index (weight in kg/(height in m)^2).
- Type: Float
- Used to measure body fat and overall health.
DiabetesPedigreeFunction:- A function that assesses the likelihood of diabetes based on family history.
- Type: Float
- This is a probabilistic value derived from genetic factors.
Age:- Age of the patient (years).
- Type: Integer
- Age can be a significant factor in diabetes onset.
Outcome:- Target variable indicating whether the patient has diabetes (1) or not (0).
- Type: Integer (Binary Classification)
Import Libraries:¶
In [81]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, f1_score, recall_score
import warnings as w
w.filterwarnings('ignore')
Load and inspect the dataset:¶
In [2]:
data = pd.read_csv('dibetiese.csv')
data.head()
Out[2]:
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
| 1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
| 2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
| 3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
| 4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
In [3]:
display(data.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 768 entries, 0 to 767 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pregnancies 768 non-null int64 1 Glucose 768 non-null int64 2 BloodPressure 768 non-null int64 3 SkinThickness 768 non-null int64 4 Insulin 768 non-null int64 5 BMI 768 non-null float64 6 DiabetesPedigreeFunction 768 non-null float64 7 Age 768 non-null int64 8 Outcome 768 non-null int64 dtypes: float64(2), int64(7) memory usage: 54.1 KB
None
In [4]:
print(f"The dataset has {data.shape[0]} rows, and {data.shape[1]} columns.")
The dataset has 768 rows, and 9 columns.
In [5]:
missing_values = data.isnull().sum().sum()
if missing_values > 0:
display(data.isnull().sum())
else:
print(f"Tha dataset has {missing_values} missing values.")
Tha dataset has 0 missing values.
In [6]:
duplicate_values = data.duplicated().sum()
print(f"Tha dataset has {duplicate_values} duplicate values.")
Tha dataset has 0 duplicate values.
In [7]:
display(data.describe())
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| count | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 |
| mean | 3.845052 | 120.894531 | 69.105469 | 20.536458 | 79.799479 | 31.992578 | 0.471876 | 33.240885 | 0.348958 |
| std | 3.369578 | 31.972618 | 19.355807 | 15.952218 | 115.244002 | 7.884160 | 0.331329 | 11.760232 | 0.476951 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.078000 | 21.000000 | 0.000000 |
| 25% | 1.000000 | 99.000000 | 62.000000 | 0.000000 | 0.000000 | 27.300000 | 0.243750 | 24.000000 | 0.000000 |
| 50% | 3.000000 | 117.000000 | 72.000000 | 23.000000 | 30.500000 | 32.000000 | 0.372500 | 29.000000 | 0.000000 |
| 75% | 6.000000 | 140.250000 | 80.000000 | 32.000000 | 127.250000 | 36.600000 | 0.626250 | 41.000000 | 1.000000 |
| max | 17.000000 | 199.000000 | 122.000000 | 99.000000 | 846.000000 | 67.100000 | 2.420000 | 81.000000 | 1.000000 |
In [8]:
for i in data.columns:
fig = px.box(x = data[i], title=f"{i}")
fig.update_layout(xaxis_title = f'{i}')
fig.show()
In [9]:
def outlier(data, col):
Q1 = data[col].quantile(0.25)
Q3 = data[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outlier_ = data[(data[col] < lower_bound) | (data[col] > upper_bound)]
return len(outlier_)
In [10]:
outliers_ = {}
for i in data.select_dtypes(exclude=['object']):
outliers = outlier(data, i)
outliers_[i] = outliers
outliers = pd.DataFrame(outliers_, index=[0])
outliers
Out[10]:
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 4 | 5 | 45 | 1 | 34 | 19 | 29 | 9 | 0 |
Outlier Handling¶
In [11]:
def capping_outliers_with_nan(df, column):
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df[column] = np.where((df[column] < lower_bound) | (df[column] > upper_bound), np.nan, df[column])
for column in data.select_dtypes(exclude='object'):
capping_outliers_with_nan(data, column)
display(data.head())
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 6.0 | 148.0 | 72.0 | 35.0 | 0.0 | 33.6 | 0.627 | 50.0 | 1.0 |
| 1 | 1.0 | 85.0 | 66.0 | 29.0 | 0.0 | 26.6 | 0.351 | 31.0 | 0.0 |
| 2 | 8.0 | 183.0 | 64.0 | 0.0 | 0.0 | 23.3 | 0.672 | 32.0 | 1.0 |
| 3 | 1.0 | 89.0 | 66.0 | 23.0 | 94.0 | 28.1 | 0.167 | 21.0 | 0.0 |
| 4 | 0.0 | 137.0 | 40.0 | 35.0 | 168.0 | 43.1 | NaN | 33.0 | 1.0 |
In [12]:
data.isnull().sum()
Out[12]:
Pregnancies 4 Glucose 5 BloodPressure 45 SkinThickness 1 Insulin 34 BMI 19 DiabetesPedigreeFunction 29 Age 9 Outcome 0 dtype: int64
In [13]:
for i in data.columns:
data[i] = data[i].fillna(data[i].median())
In [14]:
data.describe()
Out[14]:
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| count | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 |
| mean | 3.782552 | 121.656250 | 72.196615 | 20.437500 | 59.569010 | 32.198958 | 0.427044 | 32.760417 | 0.348958 |
| std | 3.270644 | 30.438286 | 11.146723 | 15.698554 | 78.415321 | 6.410558 | 0.245323 | 11.055385 | 0.476951 |
| min | 0.000000 | 44.000000 | 38.000000 | 0.000000 | 0.000000 | 18.200000 | 0.078000 | 21.000000 | 0.000000 |
| 25% | 1.000000 | 99.750000 | 64.000000 | 0.000000 | 0.000000 | 27.500000 | 0.243750 | 24.000000 | 0.000000 |
| 50% | 3.000000 | 117.000000 | 72.000000 | 23.000000 | 0.000000 | 32.000000 | 0.356000 | 29.000000 | 0.000000 |
| 75% | 6.000000 | 140.250000 | 80.000000 | 32.000000 | 110.000000 | 36.300000 | 0.582250 | 40.000000 | 1.000000 |
| max | 13.000000 | 199.000000 | 106.000000 | 63.000000 | 318.000000 | 50.000000 | 1.191000 | 66.000000 | 1.000000 |
In [15]:
data.isnull().sum()
Out[15]:
Pregnancies 0 Glucose 0 BloodPressure 0 SkinThickness 0 Insulin 0 BMI 0 DiabetesPedigreeFunction 0 Age 0 Outcome 0 dtype: int64
EDA¶
In [16]:
def classify_features(data):
categorical_features = []
non_categorical_features = []
discrete_features = []
continuous_features = []
for column in data.columns:
if data[column].dtype == 'object':
if data[column].nunique() < 10:
categorical_features.append(column)
else:
non_categorical_features.append(column)
elif data[column].dtype in ['int64','float64']:
if data[column].nunique() < 10:
discrete_features.append(column)
else:
continuous_features.append(column)
return categorical_features, non_categorical_features, discrete_features, continuous_features
categorical_features, non_categorical_features, discrete_features, continuous_features = classify_features(data)
In [17]:
print(f"Categorical Features: {len(categorical_features)}")
Categorical Features: 0
In [18]:
print(f"Non Categorical Features: {len(non_categorical_features)}")
Non Categorical Features: 0
In [19]:
print(f"Discrete Features: {len(discrete_features)}")
print(discrete_features)
Discrete Features: 1 ['Outcome']
In [20]:
print(f"Continuous Features: {len(continuous_features)}")
print(continuous_features)
Continuous Features: 8 ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
In [21]:
for i in continuous_features:
fig = px.histogram(data[i], title=f"{i}",)
fig.show()
In [22]:
for i in continuous_features:
plt.figure(figsize=(16,6))
sns.histplot(data[i], bins = 20, kde = True, palette ='hls')
plt.xticks(rotation = 90)
plt.show()
In [123]:
for i in continuous_features:
print(i)
value_counts = data[i].value_counts().reset_index()
value_counts = pd.DataFrame(value_counts)
display(value_counts)
print()
Pregnancies
| Pregnancies | count | |
|---|---|---|
| 0 | 1.0 | 135 |
| 1 | 0.0 | 111 |
| 2 | 2.0 | 103 |
| 3 | 3.0 | 79 |
| 4 | 4.0 | 68 |
| 5 | 5.0 | 57 |
| 6 | 6.0 | 50 |
| 7 | 7.0 | 45 |
| 8 | 8.0 | 38 |
| 9 | 9.0 | 28 |
| 10 | 10.0 | 24 |
| 11 | 11.0 | 11 |
| 12 | 13.0 | 10 |
| 13 | 12.0 | 9 |
Glucose
| Glucose | count | |
|---|---|---|
| 0 | 99.0 | 17 |
| 1 | 100.0 | 17 |
| 2 | 117.0 | 16 |
| 3 | 129.0 | 14 |
| 4 | 125.0 | 14 |
| ... | ... | ... |
| 130 | 191.0 | 1 |
| 131 | 177.0 | 1 |
| 132 | 44.0 | 1 |
| 133 | 62.0 | 1 |
| 134 | 190.0 | 1 |
135 rows × 2 columns
BloodPressure
| BloodPressure | count | |
|---|---|---|
| 0 | 72.0 | 89 |
| 1 | 70.0 | 57 |
| 2 | 74.0 | 52 |
| 3 | 78.0 | 45 |
| 4 | 68.0 | 45 |
| 5 | 64.0 | 43 |
| 6 | 80.0 | 40 |
| 7 | 76.0 | 39 |
| 8 | 60.0 | 37 |
| 9 | 62.0 | 34 |
| 10 | 66.0 | 30 |
| 11 | 82.0 | 30 |
| 12 | 88.0 | 25 |
| 13 | 84.0 | 23 |
| 14 | 90.0 | 22 |
| 15 | 58.0 | 21 |
| 16 | 86.0 | 21 |
| 17 | 50.0 | 13 |
| 18 | 56.0 | 12 |
| 19 | 54.0 | 11 |
| 20 | 52.0 | 11 |
| 21 | 75.0 | 8 |
| 22 | 92.0 | 8 |
| 23 | 65.0 | 7 |
| 24 | 85.0 | 6 |
| 25 | 94.0 | 6 |
| 26 | 48.0 | 5 |
| 27 | 96.0 | 4 |
| 28 | 44.0 | 4 |
| 29 | 98.0 | 3 |
| 30 | 100.0 | 3 |
| 31 | 106.0 | 3 |
| 32 | 104.0 | 2 |
| 33 | 46.0 | 2 |
| 34 | 55.0 | 2 |
| 35 | 95.0 | 1 |
| 36 | 102.0 | 1 |
| 37 | 61.0 | 1 |
| 38 | 38.0 | 1 |
| 39 | 40.0 | 1 |
SkinThickness
| SkinThickness | count | |
|---|---|---|
| 0 | 0.0 | 227 |
| 1 | 32.0 | 31 |
| 2 | 30.0 | 27 |
| 3 | 23.0 | 23 |
| 4 | 27.0 | 23 |
| 5 | 33.0 | 20 |
| 6 | 28.0 | 20 |
| 7 | 18.0 | 20 |
| 8 | 31.0 | 19 |
| 9 | 19.0 | 18 |
| 10 | 39.0 | 18 |
| 11 | 29.0 | 17 |
| 12 | 40.0 | 16 |
| 13 | 25.0 | 16 |
| 14 | 26.0 | 16 |
| 15 | 22.0 | 16 |
| 16 | 37.0 | 16 |
| 17 | 41.0 | 15 |
| 18 | 35.0 | 15 |
| 19 | 36.0 | 14 |
| 20 | 15.0 | 14 |
| 21 | 17.0 | 14 |
| 22 | 20.0 | 13 |
| 23 | 24.0 | 12 |
| 24 | 42.0 | 11 |
| 25 | 13.0 | 11 |
| 26 | 21.0 | 10 |
| 27 | 46.0 | 8 |
| 28 | 34.0 | 8 |
| 29 | 38.0 | 7 |
| 30 | 12.0 | 7 |
| 31 | 43.0 | 6 |
| 32 | 11.0 | 6 |
| 33 | 16.0 | 6 |
| 34 | 45.0 | 6 |
| 35 | 14.0 | 6 |
| 36 | 10.0 | 5 |
| 37 | 44.0 | 5 |
| 38 | 48.0 | 4 |
| 39 | 47.0 | 4 |
| 40 | 50.0 | 3 |
| 41 | 49.0 | 3 |
| 42 | 8.0 | 2 |
| 43 | 54.0 | 2 |
| 44 | 7.0 | 2 |
| 45 | 52.0 | 2 |
| 46 | 60.0 | 1 |
| 47 | 56.0 | 1 |
| 48 | 51.0 | 1 |
| 49 | 63.0 | 1 |
Insulin
| Insulin | count | |
|---|---|---|
| 0 | 0.0 | 408 |
| 1 | 105.0 | 11 |
| 2 | 130.0 | 9 |
| 3 | 140.0 | 9 |
| 4 | 120.0 | 8 |
| ... | ... | ... |
| 151 | 68.0 | 1 |
| 152 | 29.0 | 1 |
| 153 | 42.0 | 1 |
| 154 | 184.0 | 1 |
| 155 | 112.0 | 1 |
156 rows × 2 columns
BMI
| BMI | count | |
|---|---|---|
| 0 | 32.0 | 32 |
| 1 | 31.6 | 12 |
| 2 | 31.2 | 12 |
| 3 | 32.4 | 10 |
| 4 | 33.3 | 10 |
| ... | ... | ... |
| 235 | 30.7 | 1 |
| 236 | 22.7 | 1 |
| 237 | 45.4 | 1 |
| 238 | 42.0 | 1 |
| 239 | 46.3 | 1 |
240 rows × 2 columns
DiabetesPedigreeFunction
| DiabetesPedigreeFunction | count | |
|---|---|---|
| 0 | 0.356 | 31 |
| 1 | 0.254 | 6 |
| 2 | 0.258 | 6 |
| 3 | 0.238 | 5 |
| 4 | 0.268 | 5 |
| ... | ... | ... |
| 484 | 0.997 | 1 |
| 485 | 0.226 | 1 |
| 486 | 0.612 | 1 |
| 487 | 0.655 | 1 |
| 488 | 0.171 | 1 |
489 rows × 2 columns
Age
| Age | count | |
|---|---|---|
| 0 | 22.0 | 72 |
| 1 | 21.0 | 63 |
| 2 | 25.0 | 48 |
| 3 | 24.0 | 46 |
| 4 | 29.0 | 38 |
| 5 | 23.0 | 38 |
| 6 | 28.0 | 35 |
| 7 | 26.0 | 33 |
| 8 | 27.0 | 32 |
| 9 | 31.0 | 24 |
| 10 | 41.0 | 22 |
| 11 | 30.0 | 21 |
| 12 | 37.0 | 19 |
| 13 | 42.0 | 18 |
| 14 | 33.0 | 17 |
| 15 | 36.0 | 16 |
| 16 | 32.0 | 16 |
| 17 | 38.0 | 16 |
| 18 | 45.0 | 15 |
| 19 | 34.0 | 14 |
| 20 | 46.0 | 13 |
| 21 | 43.0 | 13 |
| 22 | 40.0 | 13 |
| 23 | 39.0 | 12 |
| 24 | 35.0 | 10 |
| 25 | 52.0 | 8 |
| 26 | 44.0 | 8 |
| 27 | 50.0 | 8 |
| 28 | 51.0 | 8 |
| 29 | 58.0 | 7 |
| 30 | 47.0 | 6 |
| 31 | 54.0 | 6 |
| 32 | 48.0 | 5 |
| 33 | 60.0 | 5 |
| 34 | 57.0 | 5 |
| 35 | 49.0 | 5 |
| 36 | 53.0 | 5 |
| 37 | 63.0 | 4 |
| 38 | 66.0 | 4 |
| 39 | 62.0 | 4 |
| 40 | 55.0 | 4 |
| 41 | 65.0 | 3 |
| 42 | 56.0 | 3 |
| 43 | 59.0 | 3 |
| 44 | 61.0 | 2 |
| 45 | 64.0 | 1 |
In [24]:
for i in range(len(continuous_features)):
for j in range(i+1, len(continuous_features)):
plt.figure(figsize=(15,6))
sns.scatterplot(x=continuous_features[i], y=continuous_features[j], data=data, palette = 'hls', hue=data['Outcome'])
plt.title(f'scatter plot of {continuous_features[i]} vs {continuous_features[j]}')
plt.show()
In [25]:
features = data.drop(columns=['Outcome'], axis=1)
sns.pairplot(data, hue='Outcome', vars=features)
plt.show()
In [26]:
correlation_matrix = data[continuous_features].corr()
correlation_matrix
Out[26]:
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | |
|---|---|---|---|---|---|---|---|---|
| Pregnancies | 1.000000 | 0.117692 | 0.208953 | -0.096720 | -0.108077 | 0.028339 | 0.004937 | 0.560768 |
| Glucose | 0.117692 | 1.000000 | 0.204539 | 0.060034 | 0.157277 | 0.228245 | 0.080436 | 0.274264 |
| BloodPressure | 0.208953 | 0.204539 | 1.000000 | 0.025645 | -0.049508 | 0.271560 | 0.022533 | 0.326372 |
| SkinThickness | -0.096720 | 0.060034 | 0.025645 | 1.000000 | 0.454830 | 0.373726 | 0.151583 | -0.101397 |
| Insulin | -0.108077 | 0.157277 | -0.049508 | 0.454830 | 1.000000 | 0.163918 | 0.192998 | -0.075614 |
| BMI | 0.028339 | 0.228245 | 0.271560 | 0.373726 | 0.163918 | 1.000000 | 0.123177 | 0.077668 |
| DiabetesPedigreeFunction | 0.004937 | 0.080436 | 0.022533 | 0.151583 | 0.192998 | 0.123177 | 1.000000 | 0.035872 |
| Age | 0.560768 | 0.274264 | 0.326372 | -0.101397 | -0.075614 | 0.077668 | 0.035872 | 1.000000 |
In [27]:
fig = px.imshow(correlation_matrix,
labels=dict(color="Correlation"),
x=correlation_matrix.columns,
y=correlation_matrix.columns, color_continuous_scale='tempo',
title="Correlation Matrix Heatmap",text_auto=True, width=1100, height=700)
fig.show()
In [28]:
threshold = 0.80
highly_correlated_features = set()
for i in range(len(correlation_matrix.columns)):
for j in range(i):
if abs(correlation_matrix.iloc[i, j]) > threshold:
colname = correlation_matrix.columns[i]
highly_correlated_features.add(colname)
highly_correlated_features.add(correlation_matrix.columns[j])
print(f"Highly Correlated Features: {highly_correlated_features}")
Highly Correlated Features: set()
In [29]:
fig = px.histogram(data['Outcome'], title='Outcome', color=data['Outcome'])
fig.show()
In [30]:
x = data.drop(columns=['Outcome'], axis=1)
y = data['Outcome']
smote = SMOTE()
x_smote, y_smote = smote.fit_resample(x, y)
In [31]:
fig = px.histogram(y_smote, color=y_smote, title='SMOTE data')
fig.show()
Logistic Regrassion Models¶
Base Model¶
In [88]:
modelss = {
'Random Forest Classifier':RandomForestClassifier(),
'Decision Tree Regressor':DecisionTreeClassifier(),
'Gradient Boosting Classifier':GradientBoostingClassifier(),
'Logistic Regression':LogisticRegression(),
'AdaBoostClassifier':AdaBoostClassifier(),
'Suport Vector Classifier': SVC(),
'XGBClassifier': XGBClassifier()
}
In [89]:
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']),
])
preprocessor
Out[89]:
ColumnTransformer(transformers=[('num', StandardScaler(),
['Pregnancies', 'Glucose', 'BloodPressure',
'SkinThickness', 'Insulin', 'BMI',
'DiabetesPedigreeFunction', 'Age'])])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ColumnTransformer(transformers=[('num', StandardScaler(),
['Pregnancies', 'Glucose', 'BloodPressure',
'SkinThickness', 'Insulin', 'BMI',
'DiabetesPedigreeFunction', 'Age'])])['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
StandardScaler()
In [93]:
results = {}
for name, model in modelss.items():
pipeline = Pipeline(steps = [('preprocessor', preprocessor),('model', model)])
x_train, x_test, y_train, y_test = train_test_split(x_smote, y_smote, test_size=0.2, random_state=42)
pipeline.fit(x_train, y_train)
y_test_pred = pipeline.predict(x_test)
y_train_pred = pipeline.predict(x_train)
test_accuracy_score_ = accuracy_score(y_test_pred, y_test)
train_accuracy_score_ = accuracy_score(y_train_pred, y_train)
f1_score_ = f1_score(y_pred, y_test)
recall_score_ = recall_score(y_pred, y_test)
results[name] = {
'Test Accuracy Score': f"{test_accuracy_score_:.2f}",
'Train Accuracy Score': f"{train_accuracy_score_:.2f}"
}
results = pd.DataFrame(results).T
results
Out[93]:
| Test Accuracy Score | Train Accuracy Score | |
|---|---|---|
| Random Forest Classifier | 0.79 | 1.00 |
| Decision Tree Regressor | 0.70 | 1.00 |
| Gradient Boosting Classifier | 0.72 | 0.92 |
| Logistic Regression | 0.70 | 0.77 |
| AdaBoostClassifier | 0.72 | 0.84 |
| Suport Vector Classifier | 0.79 | 0.86 |
| XGBClassifier | 0.77 | 1.00 |
Hyperparameter Tuning¶
In [94]:
param_grids = {
'Random Forest Classifier': {'model__n_estimators': [1,2,3,4,5,6,7,8,9,10]},
'Decision Tree Regressor': {'model__max_depth': [1,2,3,4,5,6,7,8,9,10,20,30]},
'Gradient Boosting Classifier': {'model__n_estimators': [1,2,3,4,5,6,7,8,9,10,20.30]},
'Logistic Regression': {'model__C': [0.001, 0.01, 0.1, 1.0, 10.0]},
'Suport Vector Classifier': {'model__C': [0.001, 0.01, 0.1, 1.0, 10.0]},
'XGBClassifier': {'model__learning_rate': [0.001, 0.01, 0.1, 1.0, 10.0]}
}
In [95]:
classifier_metrics = {'accuracy': accuracy_score, 'f1': f1_score, 'recall': recall_score}
regressor_metrics = {'mse': mean_squared_error}
results = {}
for name, model in modelss.items():
pipeline = Pipeline(steps=[('preprocessor',preprocessor),('model', model)])
param_grid = param_grids.get(name, {})
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
x_train, x_test, y_train, y_test = train_test_split(x_smote, y_smote, test_size=0.2, random_state=42)
grid_search.fit(x_train, y_train)
best_model = grid_search.best_estimator_
y_train_pred = best_model.predict(x_train)
y_test_pred = best_model.predict(x_test)
test_accuracy_score_ = accuracy_score(y_test_pred, y_test)
train_accuracy_score_ = accuracy_score(y_train_pred, y_train)
f1_score_ = f1_score(y_pred, y_test)
recall_score_ = recall_score(y_pred, y_test)
results[name] = {'Best Params': grid_search.best_params_, 'Test Accuracy Score': f"{test_accuracy_score_:.2f}",
'Train Accuracy Score': f"{train_accuracy_score_:.2f}",
'F1 Score': f"{f1_score_:0.2f}", 'Recall Score': f"{recall_score_:0.2f}"}
results_df = pd.DataFrame(results).T
display(results_df)
| Best Params | Test Accuracy Score | Train Accuracy Score | F1 Score | Recall Score | |
|---|---|---|---|---|---|
| Random Forest Classifier | {'model__n_estimators': 9} | 0.78 | 0.99 | 0.80 | 0.78 |
| Decision Tree Regressor | {'model__max_depth': 4} | 0.72 | 0.82 | 0.80 | 0.78 |
| Gradient Boosting Classifier | {'model__n_estimators': 4} | 0.74 | 0.82 | 0.80 | 0.78 |
| Logistic Regression | {'model__C': 10.0} | 0.70 | 0.77 | 0.80 | 0.78 |
| AdaBoostClassifier | {} | 0.72 | 0.84 | 0.80 | 0.78 |
| Suport Vector Classifier | {'model__C': 10.0} | 0.76 | 0.91 | 0.80 | 0.78 |
| XGBClassifier | {'model__learning_rate': 0.1} | 0.74 | 1.00 | 0.80 | 0.78 |
Best Model¶
With Imbalanced Data¶
In [132]:
preprocessor = ColumnTransformer( transformers = [
('num', StandardScaler(), ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']),
])
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('model',LogisticRegression(random_state=55)),
])
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
pipeline.fit(x_train,y_train)
y_test_pre = pipeline.predict(x_test)
y_train_pre = pipeline.predict(x_train)
test_accuracy_score_ = accuracy_score(y_test_pre, y_test)
train_accuracy_score_ = accuracy_score(y_train_pre, y_train)
print(f"test accuracy: {test_accuracy_score_ * 100:.2f}%")
print(f"train accuracy: {train_accuracy_score_ * 100:.2f}%")
test accuracy: 74.68% train accuracy: 78.34%
With balanced Data¶
In [130]:
preprocessor = ColumnTransformer( transformers = [
('num', StandardScaler(), ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']),
])
pipeline = Pipeline( steps = [
('preprocessor', preprocessor),
('model',LogisticRegression(random_state=25)),
])
x_train, x_test, y_train, y_test = train_test_split(x_smote, y_smote, test_size=0.2, random_state=42)
pipeline.fit(x_train, y_train)
y_test_pre = pipeline.predict(x_test)
y_train_pre = pipeline.predict(x_train)
test_accuracy_score_ = accuracy_score(y_test_pre, y_test)
train_accuracy_score_ = accuracy_score(y_train_pre, y_train)
print(f"test accuracy: {test_accuracy_score_ * 100:.2f}%")
print(f"train accuracy: {train_accuracy_score_ * 100:.2f}%")
test accuracy: 70.00% train accuracy: 77.12%
In [ ]: